Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ [file-based cdk] S3 file format adapter #29353

Merged
merged 24 commits into from
Aug 14, 2023
Merged

Conversation

brianjlai
Copy link
Contributor

Closes #29292

What

Adds to the existing S3 adapter to transform an incoming legacy configs into a config that can be used by the v4 connector

How

Given that I don't think any of the other file formats need the adapter I went ahead and added tests and some transformation logic which gets rid of the fields that aren't actually used by the v4 connector.

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/s3 labels Aug 11, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 11, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@brianjlai brianjlai changed the base branch from master to issue-28893/infer-schema-csv August 11, 2023 02:03
@brianjlai brianjlai changed the title [file-based cdk] S3 file format adapter ✨ [file-based cdk] S3 file format adapter Aug 11, 2023
@brianjlai brianjlai marked this pull request as ready for review August 11, 2023 05:56
@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Aug 11, 2023
@brianjlai brianjlai requested a review from a team August 11, 2023 05:56
@brianjlai brianjlai requested review from maxi297 and clnoll August 11, 2023 05:57
@octavia-squidington-iii
Copy link
Collaborator

source-s3 test report (commit 2f827c0334) - ❌

⏲️ Total pipeline duration: 22mn17s

Step Result
Validate airbyte-integrations/connectors/source-s3/metadata.yaml
Connector version semver check
Connector version increment check
QA checks
Code format checks
Connector package install
Build source-s3 docker image for platform linux/x86_64
Unit tests
Integration tests
Acceptance tests

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

@octavia-squidington-iii
Copy link
Collaborator

source-s3 test report (commit 2f827c0334) - ❌

⏲️ Total pipeline duration: 20mn18s

Step Result
Validate airbyte-integrations/connectors/source-s3/metadata.yaml
Connector version semver check
Connector version increment check
QA checks
Code format checks
Connector package install
Build source-s3 docker image for platform linux/x86_64
Unit tests
Integration tests
Acceptance tests

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I forgot about the CsvSpec.advanced_options when we groomed the issue. We should forward them to the file format too

if isinstance(format_options, AvroFormat):
return {"filetype": "avro"}
elif isinstance(format_options, CsvFormat):
csv_options = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also pass the options that are hidden in the CsvSpec.advanced_options

  • advanced_options["skip_rows"] -> skip_rows_before_header
  • advanced_options["skip_rows_after_names"] -> skip_rows_after_header
  • advanced_options["autogenerate_column_names"] -> autogenerate_column_names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. will add!

"true_values": ["y", "yes", "t", "true", "on", "1"],
"false_values": ["n", "no", "f", "false", "off", "0"],
"inference_type": "Primitive Types Only" if format_options.infer_datatypes else "None",
"strings_can_be_null": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings_can_be_null should be set from CsvSpec.additional_reader_options["strings_can_be_null"] or default to false https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait so just to confirm, we want it default to false when its not in the additional options?

In the PR description we talked about during planning was: strings_can_be_null -> default to True in legacy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - I must have I derped when we groomed the issue.

@octavia-squidington-iii
Copy link
Collaborator

source-s3 test report (commit 868a597e3f) - ❌

⏲️ Total pipeline duration: 11mn05s

Step Result
Validate airbyte-integrations/connectors/source-s3/metadata.yaml
Connector version semver check
Connector version increment check
QA checks
Code format checks
Connector package install
Build source-s3 docker image for platform linux/x86_64
Unit tests
Integration tests
Acceptance tests

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

Copy link
Contributor

@girarda girarda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Base automatically changed from issue-28893/infer-schema-csv to master August 14, 2023 19:14
@octavia-squidington-iii octavia-squidington-iii added the CDK Connector Development Kit label Aug 14, 2023
@octavia-squidington-iii octavia-squidington-iii removed the CDK Connector Development Kit label Aug 14, 2023
@octavia-squidington-iii
Copy link
Collaborator

source-s3 test report (commit e48b9d527b) - ❌

⏲️ Total pipeline duration: 13mn39s

Step Result
Validate airbyte-integrations/connectors/source-s3/metadata.yaml
Connector version semver check
Connector version increment check
QA checks
Code format checks
Connector package install
Build source-s3 docker image for platform linux/x86_64
Unit tests
Integration tests
Acceptance tests

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

@octavia-squidington-iii
Copy link
Collaborator

source-s3 test report (commit bbfbd80cc6) - ❌

⏲️ Total pipeline duration: 12mn01s

Step Result
Validate airbyte-integrations/connectors/source-s3/metadata.yaml
Connector version semver check
Connector version increment check
QA checks
Code format checks
Connector package install
Build source-s3 docker image for platform linux/x86_64
Unit tests
Integration tests
Acceptance tests

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=source-s3 test

@brianjlai brianjlai merged commit 82b8274 into master Aug 14, 2023
@brianjlai brianjlai deleted the brian/s3_csv_format_adapter branch August 14, 2023 22:47
harrytou pushed a commit to KYVENetwork/airbyte that referenced this pull request Sep 1, 2023
* [ISSUE airbytehq#28893] infer csv schema

* [ISSUE airbytehq#28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE airbytehq#28893] legacy inference and infer only when needed

* [ISSUE airbytehq#28893] fix scenario tests

* [ISSUE airbytehq#28893] using discovered schema as part of read

* [ISSUE airbytehq#28893] self-review + cleanup

* [ISSUE airbytehq#28893] fix test

* [ISSUE airbytehq#28893] code review part #1

* [ISSUE airbytehq#28893] code review part #2

* Fix test

* formatcdk

* [ISSUE airbytehq#28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* pr feedback with a few additional default options set

* fix things from the rebase of master

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[file-based cdk] CSV format config adapter
4 participants